A panel regression analysis for the COVID-19 epidemic in the United States

This study explored the roles of epidemic-spread-related behaviors, vaccination status and weather factors during the COVID-19 epidemic in 50 U.S. states since March 2020. Data from March 1, 2020 to February 5, 2022 were incorporated into panel model. The states were clustered by the k-means method. In addition to discussing the whole time period, we also took multiple events nodes into account and analyzed the data in different time periods respectively by panel linear regression method. In addition, influence of cluster grouping and different incubation periods were been discussed. Non-segmented analysis showed the rate of people staying at home and the vaccination dose per capita were significantly negatively correlated with the daily incidence rate, while the number of long-distance trips was positively correlated. Weather indicators also had a negative effect to a certain extent. Most segmental results support the above view. The vaccination dose per capita was unsurprisingly proved to be the most significant factor especially for epidemic dominated by Omicron strains. 7-day was a more robust incubation period with the best model fit while weather had different effects on the epidemic spread in different time period. The implementation of prevention behaviors and the promotion of vaccination may have a successful control effect on COVID-19, including variants’ epidemic such as Omicron. The spread of COVID-19 also might be associated with weather, albeit to a lesser extent.


Introduction
The rapid spread of COVID-19 had seriously affected people's health and daily life which imposed a great burden on almost every country [1]. The COVID-19 epidemic started in December 2019 and quickly swept the world. At the beginning of 2020, the cases in the U.S. only showed a sporadic state [2,3]. However in the early days of the epidemic, heated discussions on 'wearing masks' and 'freedom and human rights' in the American society as well as residents' limited implementation of prevention measures resulted in uncontrollably spreading epidemic [4]. Until December 14, 2020, when the vaccine officially began to be universally

Data sources
In this study, data of 50 states in U.S. were used as the research subjects. According to the data released by Johns Hopkins University [5], the DIR data of each state from March 1, 2020 to February 5, 2022 were included as the dependent variable. Independent variable data included the proportion of daily residents at home (AHR) and the daily trips were both obtained from the website of the Bureau of Transportation Statistics, and trips are defined as movements that include a stay of longer than 10 minutes at an anonymized location away from home [29]. Daily trips per capita (TR) equals to the number of travel times divided by population. We also obtained the number of medium-distance trip (TR>25 miles) and long-distance/interstate trip (TR>250 miles) per capita. Independent variable data vaccination status expressed by the daily administered vaccination dose per capita (AVD) obtained from CDC [30]. Missing values were completed by linear interpolation. Weather factors including daily temperature (T), humidity (H), wind speed (WS), air pressure (AP) and precipitation (PPTN) of every state were collected from Weather Underground website [31].

Preprocessing
The onset of COVID-19 has a certain incubation period. Previous studies found that the effect time of exposure to coronavirus is about 5-7 days, even longer [32][33][34]. The result of dynamic Public Health Surveillance of U.S. COVID-19 conducted by Dr. Post [35] suggested the coefficients on the 7-day lag were both positive and statistically significant. Thus, we chose 7 days as the incubation period to preprocess the data, which means the respective variables would be correspond to the DIR after 7 days. In addition, we also took into account the fact that the SARS-CoV-2 variants may have shorter incubation period, so we also used 3-day or 10-day as the incubation period to conduct uncertainty analysis. In addition to analyzing the data from March 1, 2020 to February 5, 2022, we also divided the whole process into 6 different segments according to the time of quarantine policy introduction, the time of the first vaccination, the time when the mutant strain became popular in the U.S., etc. The influencing factors at each segment were explored. The segmentation method is as follows: At the end of March 2020, almost every state basically required the implementation of statewide stay-at-home orders for its residents [36,37]. On July 4, 2020, almost the entire country opened with virtually no restrictions [38]. In late 2020, U.S. residents began to be vaccinated, and this number was recorded by the CDC from December 12, 2020 [30]. The Delta variant was first detected in March 2021 in the U.S. [13]. The first U.S. case of COVID-19 caused by the Omicron variant was first reported on December 1, 2021 [39]. The specific segmentation method was shown in Fig 1.

Panel data model
Panel data is a set of two-dimensional cross-sectional data that contains both time and space. It can be understood as a set of data formed by intercepting certain characteristic values of i objects at t different time nodes [40]. Therefore, the panel data can be represented by double subscript variable y it . This study used a panel data model to fit the DIR of 50 states in the U.S., and considered the development of COVID-19 both in the vertical-time dimension, and the horizontalstates dimension. Through the cluster and multiple linear regression model analysis of the panel data, the characteristic of both space and time dimensions of the epidemic can be separately explored.

Statistical analysis
Firstly, the cluster analysis of the panel data model was conducted based on the traditional classic K-Means algorithm. The data in this study can be expressed as an n×d matrix X, while n is the number of samples (n = 35350 in our study), d is the dimension of the samples (d = 9 in our study). k cluster centers are expressed as k×d matrix C, while k = 3, and each row of C represents a cluster center. The distance from the sample to the k centers is expressed as an n×k matrix D.
According to the optimization problem (1) to assign each sample point to the new nearest class center (2) to form k classes and update the sample mean of this class as the class center. Then, update the class center iteratively until the class center keep stable.
Group visualization is completed according to the maximum number of days that each research object belongs to a certain category in the research time. For example, according to our study, Alaska (AK) had most days in cluster 3, thus, we classified it into the third category. Analysis of every one of three category was completed in order to test the impact of clustering results and explore the effect of factors among similar states.
The Hausman test was used to select random effects model or fixed effects model for panel regression analysis. And the fitting of the linear regression of the panel data model performed by the ordinary least square method (OLS). While based on the characteristics of panel data: the disturbance items between different individuals are independent of each other, but there is often autocorrelation among the disturbance items of the same individual in different periods, so we used the robust command to perform regression analysis under the clustering robust standard error to reduce the overestimation of the influence of the independent variable on the dependent variable to obtain a more accurate linear trend.
The python-based software code and Stata16.0 were used for analysis. α = 0.05.

Cluster and basic situation
The above independent variables were used to cluster the 50 states, and the frequency distribution cluster graph was shown in  Table 1.

Multivariate analysis
After completing the Hausman test, the fixed-effects model was selected for multivariate regression analysis. TR (>250 miles) was more stable in all models than TR or TR (>25 miles), so Table 2 showed the results of including TR (>250 miles) as an independent variable in the model. Model results involving TR or TR (>25 miles) were presented in the S2 and S3 Tables. According to the unsegmented results, AHR, AVD and DIR were significant negatively correlated, the coefficient of T, WS, AP, PPTN was rather small, but also negatively correlated with DIR. TR (>250 miles) had a significant negative effect on DIR ( Table 2). The linear regression equation was written as: The regression results of the three categories after clustering were also listed in Table 2, basically consisted with the unsegmented results. TR (>250 miles) had a stronger effect on the DIR of the first category with a higher regression coefficient, while the second category was less affected by it, but vaccine had a strong inhibition on the increase of DIR (coefficient = -3.05E+00). The results of the third category were closest to the results of the 50 states, whose models also had similar R-squares.
In the first segment, in addition to the relatively significant effect of AHR on DIR, other independent variable such as TR (>250 miles) and weather indicators had a little bit effect on DIR. TR (>250 miles) even appeared weird negative correlation with DIR, while in the first and second category, it was not significant. However, from the second segment, the relationship between TR (>250 miles) and DIR became much more normal. The effect of weather on DIR was weak, T and DIR showed a positive correlation which was different from the unsegmented results.
In the second segment, the positive effect of TR (>250 miles) on DIR was even much higher than that of AHR (1.91E-01>2.72E-02). Similar to the first segment, the effect of weather on DIR was also weak, but a statistical association could be found, with both T and H positively contributing to DIR in this segment. The results of the three classification models were basically the same.
The third segment was the time after the full unblocking and before vaccination, and the effect of AHR on DIR was significantly higher than that of TR (>250 miles). The first category In the fourth segment, AHR was still negatively correlated with DIR, while the effect of the vaccination was the most significant-its coefficient reached -1.05E+00, this negative effect was even more obvious in the first and second category models. The effect of TR (>250 miles) on DIR was not found. DIR s4 ¼ À 5:87 E À 02 � AHR À 1:05E þ 00 � AVD þ1:38E À 04 � T þ 1:16E À 05 � H À 1:89E À 04 � WS À 9:62E À 04 � AP À 1:81E À 03 � PPTN þ 9:29E À 02 In the fifth segment, the negative effect of vaccination on DIR was slightly higher than the positive effect of TR (>250 miles) on DIR (5.44E-01>2.26E-01), and both of them were higher than the inhibitory effect of AHR on DIR. The results of the second category model were the closest to the overall model, the effect of vaccination in the first category model was relatively slight, DIR was mainly affected by TR (>250 miles). DIR s5 ¼ À 4:51E À 02 � AHR þ 2:26E À 01 � TRð> 250 milesÞ À 5:44E À 01 � AVD À 2:32E À 04 � T þ 7 :59E À 05 � H À 7 :08E À 04 � WS À 3:29E À 03 � AP À 2:54E À 03 � PPTN þ 1:37 E À 01 In the last segment, both vaccination and TR (>250 miles) had significantly higher effects on DIR than AHR. As in the previous two periods, the regression coefficient for vaccination was higher than that of TR (>250 miles) (2.32E+00>2.26E+00). In the first category model, DIR was also dominated by TR (>250 miles), while the effect of vaccination on DIR was not significant. However, the results of the second category model showed that DIR in these states was significantly affected by vaccination (reached up to 6.44E+00 high) but not TR.
The results of the model under 3-day or 10-day incubation period were shown in the S1-S3 Tables. According to it, the R-square performance of the models under these two incubation periods was generally lower than that of the model under the 7-day incubation period.
Moreover, the fitting results of the 3-day incubation period model for AHR were not stable enough, and the 10-day incubation period model may underestimate the effect of vaccination compared with 7-day. Besides, it was worth noting that in the latter three segments, the effect of vaccination under the 7-day incubation period on DIR was consistently higher than that from model under 3-day incubation period.

Discussion
Our study explored the roles of epidemic-spread-related behaviors and vaccination status in different segments of COVID-19 development, and used panel model clustering and liner regression to explore how these roles differ across spatial dimensions. Besides, compared different incubation periods' model fit to observe the optimal incubation period.
With the normalization of the epidemic, the ways to prevent transmission have become well known. The significant negative correlation between AHR, AVD and DIR and the significant positive correlation between TR and DIR found by the unsegmented regression model all verified without exception that the most effective ways of epidemic prevention were staying at home, reducing the number of trips (especially long-distance interstate travel) and vaccinations, etc. ( Table 2). The first segment was March 2020-a period when the epidemic had not yet fully caught on. During this period, there might not be enough cases to observe the real effect of travel times due to insufficient awareness of COVID-19 and limited testing. But under this premise, a slight association between AHR and DIR was found. The impact of travel became significant in the second segment-when all 50 states became acutely aware of the dangers of COVID-19 and enacted stay-at-home orders. We all know that after a month of quarantine, U.S. was gradually unblocking even though the outbreak was not effectively contained [38]. Thus, in the third segment, we were able to see the significant effect of AHR and travel on the epidemic, and compared with the second segment, AHR played a more important role. This might be due to the increasing awareness of COVID-19 which had indeed reduced the frequency of interstate travel. Therefore, more effective prevention behaviors-staying at home had become the main influencing factor of DIR at this stage.
In the middle and late stages of the epidemic, vaccines came out, and the American people begun to be vaccinated voluntarily or compulsorily since December 13, 2020 [30]. In the latter three segments, the role of vaccination gradually became dominant-surpassing the effect of epidemic-related behaviors on DIR. According to previous study, the effectiveness of the vaccine in the United States could reach 70-90% (within one month of vaccination) [41,42]. It is worthy of attention that the comparison among the regression models results under these three segments for the speed of the epidemic spread has been constantly changing with the mutant strain [43,44]. Compared with the fifth segment dominated by the delta strain, the vaccine was more effective in controlling the epidemic in the fourth segment. This may be the rapid increase in vaccine coverage from 0 to around 30-40% in these three months [30], targeted reductions in epidemic dominated by non-VOCs. In our model, AVD had a significant effect on DIR which was far exceeding the effect of epidemic-related behaviors while they still had significant contributions to reducing DIR. The fifth segment model showed a decline in the impact of AVD. On the one hand, it might due to the effectiveness of the vaccine gradually decreases, and even dropped to 47% after five months [42]. On the other hand, the susceptibility to vaccines of the gradual dominance of the epidemic-the Delta strain had decreased [45,46]. At this period, the effect of travel on DIR had risen again, but still lower than that of AVD. The last stage was when the epidemic dominated by Omicron, while the government gradually canceling control policies making longdistance travel became easier and more frequent [47]. This corresponded to our results that TR (>250 miles) played a huge role in the development of DIR during this period, while vaccine inhibition of DIR was relatively more pronounced, even reaching the highest coefficient in the second category model (6.44E+00).
After considering regression models for different cluster groups, we found that the third category model were the closest to the overall model. 7 states in the second category often came noteworthy results. They were more vulnerable to vaccination effect in the later period of the epidemic. Specifically in the fourth and sixth segments, the sensitivity of DIR to the vaccine even masked the effect of AHR and TR. However, the states in the first category showed different results from the overall model in the final segment, the effect of vaccine was significantly lower than that of TR (>250 miles)-it was also the exact opposite of the second category. It was an interesting phenomenon which was completely unavailable only from the total model. Different states' circumstances could really make the effect of various factors vary.
Based on the results for different incubation periods shown in the S1-S3 Tables, the 7-day incubation period model was indeed robust overall in most periods. In addition, we also found that the coefficient for AVD in the 7-day model was higher than 3-day model, while 3-day higher than 10-day in the last segment, supporting the view from Dr. Post [35]. This might be related to the reduced incubation period of the Omicron strain-previous study suggest that the lag effect was about half of that of the original strain [48]. However, our research did not fully support this view, not only model fitting degree of the 7-day model was higher than that of 3-day model, but the coefficient of AVD under the 7-day model was basically higher than that from 3-day model.
As for the research results of weather factors, it continued to maintain its relatively controversial characteristics [20]. Unsegmented results whether at 3, 7 or 10 days of incubation, suggested a negative effect of T, H, WS, AP, PPTN on DIR, even though the association was very weak. But the segmented results showed different phenomenon. In the first and second segments-corresponding to March to July, 2020-from the low temperature in winter to the high temperature in summer, T acted positively on DIR. According to the early researches, the rise in temperature showed a positive effect on the incidence, reached its peak at 60.8-82.4˚F [16]. Our results might fit this characteristic for U.S. is located in the northern hemispheremost states have average temperature lower than 82.4˚F during the first and second segments (Table 1). However, excessive temperature in summer might inhibit the spread of the virus to a certain degree, corresponding the third segment (mainly included the hottest summer and autumn) model results-T and DIR were negatively correlated.
Our research had not yet found the relationship between humidity and DIR in main model, even if it showed slight significant in some other models, the coefficient was too small which could be ignored. WS had a certain negative effect on DIR, which might be explained by the fact that the circulating air would take away the virus entrenched in one place, diluting the density and reducing the transmission power. Rainfall might also have a similar effect, especially in the late stage of the epidemic dominated by variant strains, the negative correlation coefficients of WS, PPTN and DIR in the fifth and sixth segments even increased to a certain extent compared with the main unsegment model.
Overall, compared with daily epidemic-related behaviors and vaccination, the effect of weather on DIR was not of an order of magnitude, but as a controversial factor, we still insisted on controlling the effect of weather indicators in the model, and the results might provide some support to the future researches. In addition, our study only focused on the dependent variable daily incidence, and the influencing factors considered also existed certain limitations. Therefore, we expect that more studies with dynamic effects appear to deeply explore the various factors affecting the development of the epidemic.

Conclusion
Staying at home or getting vaccinated were particularly important inhibitive behaviors for the spread of COVID-19 in U.S, even when it's in the period dominated by Omicron. Travel, especially long-distance interstate travel was a significant risk factor for the spread of epidemic. The spread also might be associated with weather, albeit to a lesser extent.